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ABSTRACT 

In this article I present lEAD, a new interface for astronomical science databases. It is based 
on a powerful, yet simple, syntax designed to completely abstract the user from the structure of 
the underlying database. The programming language chosen for its implementation, JavaScript, 
makes it possible to interact directly with the user and to provide real-time information on the 
parsing process, error messages, and the name resolution of targets; additionally, the same parsing 
engine is used for context-sensitive autocompletion. Ultimately, this product should significantly 
simplify the use of astronomical archives, inspire more advanced uses of them, and allow the user 
to focus on what scientific research to perform, instead of on how to instruct the computer to do 
it. 



Subject headings: Astronomical databases: miscellaneous 



1. Introduction 

Nowadays, a well- maintained and easily acces- 
sible data archive is critical to the success of a mid- 
to-large telescope facility. This is best appreciated 
if one looks at the large amount of pure archival 
articles, i.e articles written using data from ob- 
servations that were not proposed by any of the 
authors. For example, as noted already by Walsh 
& Hook (2006), approximately half of the publi- 
cations based on Hubble Space Telescope data are 
purely archival. Furthermore, archival research is 
of course largely dominant for surveys and for ded- 
icated telescopes, such as the Sloan Digital Sky 
Survey (York et al. 2000) or the Two Micron All 
Sky Survey (Skrutskie et al. 2006). 

Astronomical archives are very complex. On 
one hand, the internal structure of the database 
is often complicated by the need to collect in- 
trinsically different kinds of data, taken for dif- 
ferent purposes by different instruments. On the 
other hand, the archive interface must serve tech- 
nical customers, the astronomers, who sometimes 
need to perform very specific queries. As a re- 
sult, typical archive interfaces (almost always ac- 
cessible through a dedicated World Wide Web 



page) are often plagued by a large list of differ- 
ent fields and buttons to be able to accommodate 
queries from the most demanding (and technically 
inclined) user. 

Unfortunately, this also means that many 
archive interfaces are also clumsy and unfriendly 
for the large majority of users. For example, a re- 
cent survey among the ESO Science Archive users 
(Delmotte et al. 2006; see also the complete survey 
at http : / / archive . eso . org/archive/ stats/ 
survey/survey_results .html) has shown that 
the most requested improvements for the archive 
interface are the possibility to perform more com- 
plex queries (23%), an easier-to-use interface 
(20%), and a less dense main-query page (17%). 
Clearly, these three suggestions cannot be followed 
at the same time using a classical interface. 

As of recently, an increasing number of astro- 
nomical archive interfaces now accept queries writ- 
ten in SQL or extensions of this language (such as 
ADQL, Ortiz et al. 2008). Unfortunately, while 
presenting a clean query page, these solutions re- 
quire the user to know the internal structure of the 
database used at a relatively deep level and they 
ultimately make the database inaccessible to the 
less technically-inclined users. Additionally, since 
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different astronomical databases often have com- 
pletely different structures, one is forced to learn 
for each archive used very specific details that are 
of no use in different contexts. 

However, perhaps the most serious limitation of 
the currently available interfaces to astronomical 
databases is the fact that they force the user to ex- 
press the query in a form that (in all cases) strictly 
reflects the structure of the database. Ideally, in- 
stead, the user should focus on the science, and 
the interface should be flexible enough to allow a 
direct formulation of the user wishes. 

In this article I present IE AD, or Interface for 

the Exploration of Astronomical Databases, a new 
query interface for astronomical science archives 
that solves the limitations discussed above. The 
concepts presented here have been developed 
for the Hubble Space Telescope (HST) science 
archive, but could be applied equally well with- 
out significant modifications to any astronomical 
data archive (and the software has been developed 
with this aim in mind). lEAD is currently avail- 
able (integrated within a more standard query 
interface) at the ST-ECF HST Archive (http: 
/ / archive . eso . org/ archive/hst/search) and 
at the CADC HST Archive (http://www.cadc. 
hia.nrc .gc . ca/hst/new) under the name "one- 
line query". 

The paper is organized as follows. In Sect. 2 I 
present the basic ideas behind the lEAD system 
and I briefly introduce its main features. A few 
user aids (including interactive parsing and auto- 
completion) are discussed in Sect. 3. The full gen- 
eral syntax is described in detail in Sect. 4, and 
its implementation in Sect. 5. In Sect. 6 possible 
extensions of this research are presented together 
with some of the difficulties that one might en- 
counter. Finally, the paper is closed with a quick 
summary (Sect. 7). 

2. lEAD: the concept 

With the advent of powerful and "smart" search 
engines, we all are used to the idea that simple in- 
terfaces should be provided to perform recurrent 
tasks. However, behind the apparent simplicity of 
these search engines, there are a lot of complex 
tasks that are performed behind the scenes: for 
example, some web search engines allow queries 
to be formulated using natural languages, or with 



boolean operators, or using special keywords to 
restrict the outputs. In general, it appears that 
the current focus in the development of search en- 
gines is to adapt them the user needs, rather than 
to force the user to adapt to their designs and lim- 
itations. 

A few search engines are designed to be able to 
perform both simple and (very) complex queries. 
Examples can be found in many Google prod- 
ucts (the standard web search engine, but also 
the search interface for messages in GMail or for 
RSS in the Google RSS reader) and on many 
e-commerce sites (such as Amazon). In the astro- 
nomical fleld, a similar but much simpler prod- 
uct can be found in the one-line NASA ADS 
interface.^ In a different context, the get script 
command of Aladin can associate automatically 
a set of keywords (that can be entered in an ar- 
bitrary order) with the server query vocabulary 
(see http : //aladin . u-strasbg . f r/ j ava/FAQ . 
htx#ToC99). It is along these lines that I designed 
the lEAD search engine for the Hubble Space 
Telescope archive. 

Ideally, the "perfect" interface should be very 
intuitive to use and little or no explanations should 
be needed; still, it should be powerful enough to 
allow complex queries. The interface should use a 
simple syntax, or should accept queries formulated 
in a natural language. Finally, the user should be 
able to profit from the archive without knowing in 
detail the structure of the database. 

Natiiral language user interfaces are generally 
difficult to implement and represent a very active 
field of research in computer science (e.g. Androut- 
sopoulos et al. 1995; Popescu et al. 2003). In par- 
ticular, a critical task in a natural language inter- 
face is the entity identification, i.e. the classifica- 
tion of the various terms present in a query. Fortu- 
nately, astronomical data archives are very favor- 
able in this respect because many query terms can 
be uniquely associated with a particiilar data field, 
a fact that makes it possible to have unambiguous 
and still very simple queries. For example, quan- 
tities such as instrument names, camera names, 
filters, optical element types (such as "filter" or 
"grism" or "prism") can assume only a fixed set 
of single- word values. Slightly more complex val- 
ues, such as principal investigator (PI) names or 
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data type names (such as "imaging" or "2D spec- 
troscopy" ) still assume values from a limited set of 
single or multi-word values. Real values, such as 
astronomical coordinates, search radii, exposure 
times, exposure dates, or observing wavelengths, 
can usually be easily disentangled from one an- 
other by their format (say hh:m,m,:ss for Right As- 
cension vs. ±dd:mm:ss for Declination) or from 
their units (SOarcsec is probably a search radius, 
while 2h is likely an exposure time). Finally, ev- 
erything else, i.e. everything that are not recog- 
nized as one of the fields mentioned above, has to 
be a target name, and this assumption can then be 
verified using the SIMBAD (Wenger et al. 2000) 
or NED (Helou et al. 1991) name resolvers. 

In this simple approach, the string (queries are 
case insensitive) 

acs m42 f775w 

is immediately understood to be a query for all 
instrument=acs data taken with the filter= 
f 775w around the (SIMBAD resolved) coordinates 
of m42. This simple example immediately high- 
lights one of the main advantages of the lEAD 
system over the other query interfaces commonly 
found for astronomical databases: a level of ab- 
straction is removed, and the user is free to ex- 
press the query in much more natural way. This 
unique feature makes the use of the database more 
direct, and ultimately makes database research 
much easier by bringing the query language closer 
to the astronomer. Note that although the use 
of automatic entity identification is not entirely 
new in the astronomical context (cf. the aforemen- 
tioned NASA ADS query interface and the Aladin 
get script command), lAED pushes this concept 
much forward: now entire database queries, possi- 
bly composed of more than twenty difi^erent kinds 
of entities, are automatically parsed. Compared to 
standard archive interfaces, as an additional bonus 
the user does not need to look for the right quanti- 
ties and enter different values in the correct fields, 
but rather can mix together all values and still 
obtain sensible results. 

As a second example, the string 

stis Oil imaging planetary nebula 

is interpreted as a query for all instrument=stis 
data_type=imaging data taken with the filter 



=011 for description="plcinetary nebula". Fi- 
nally, the example 

nic3 2d spectroscopy hdf-n thompson >30inin 

can be used to select all earner a=nic3 data_type="2d 
spectroscopy" observations made by pi=thompson 
around the target=HDF-N, with exptime>30niin. 

3. User aids 

Since the lEAD query interface is based on 
JavaScript, it runs entirely on the user computer; 
additionally, the parser is custom written, and is 
therefore fast enough to parse complex queries in 
real time. This makes it possible to complement 
the interface with two user aids: an automatic dis- 
play of the parsing result, and a smart the auto- 
completion on the query. 

The automatic display of the parsing result 
shows not only how every word entered is inter- 
preted, but also the way that the various con- 
straints are linked together. For example, for 
the first example discussed above the interface it 
shows 

(inst=acs and (target=m42 [ok,HII] and 
box=600arcsec) and f ilter=f 775w) 

This text provides a wealth of information. First, 
it is obvious that acs is taken to be an instrument, 
f 775w a filter, and m42 a target name. Addition- 
ally, the target has been correctly resolved and is 
a HII region ([ok.HII]); note that the resolver 
(SIMBAD) and the meaning of the HII code (HII 
ionized region) is visible if the user leaves the cur- 
sor over the HII text. All terms are combined us- 
ing the and boolean operator, and the coordinate 
search is performed around the target name with 
the default searching box, 10 arcmin. 

In case of a parsing error the system is also able 
to show immediately a descriptive error message 
and to indicate where the error took place. Simi- 
larly, the system also informs the user in real time 
when a target cannot be resolved: this is a no- 
stopping error (in the sense that the user can still 
type text in the field) since the problem might rely 
on the target name to be incomplete; however, the 
query will not be executed if the target cannot be 
resolved. 
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Additionally, when the user starts filling in the 
query field, the system shows all possible comple- 
tions. The autocompletion is automatically shown 
as soon as the last query text has at the end an 
incomplete word (i.e., no completion is shown if 
the query text is empty or if its last character is 
a space). The autocompletion uses the complete 
parsing engine, and therefore is fully context sen- 
sitive: at any given moment the system knows ex- 
actly what are the possible completion words, and 
can show them to the user. 

Finally, it is worth mentioning that both the 
ST-ECF and the CADC HST archive pages embed 
the lEAD system inside a traditional query form, 
composed of different entries for each field. In 
these pages each entry from the traditional query 
form can be dragged and dropped into the lEAD 
one-line interface. This should help users to get ac- 
quainted with the new interface and to enter there 
constraints for more hardly to remember, uncom- 
mon keywords. 

4. General syntcLx 

The general syntax accepted by lEAD is repre- 
sented in Fig. 1 and will be explained in detail in 
this section. It is based on qualified terms, i.e. on 
a combination keyword-operator-value such as 

instrument=ACS 

Both the keyword and the operator can be omit- 
ted: if the keyword is omitted, it is inferred from 
the value (in the example above, ACS is obviously 
an instrument); if the operator is omitted, the 
"equal" (=) is assumed, unless the value is pre- 
ceded with a minus sign, in which case the as- 
sumed operator is "non equal" ( ! =) . Therefore, 
all these queries are equivalent to the one written 
above 

instrument ACS =ACS ACS 

while -ACS is interpreted as instrument ! =ACS. 

4.1. Automatic keyword identification 

As discussed above, the entity identification is 
a key feature of the lEAD system and a critical 
step toward a natural language query. This task is 
performed by attaching to each word (or group of 
words) a keyword and operator, when these have 
not been explicitly assigned. 



The automatic identification of the keyword 
uses a simple but effective scheme. The value is 
compared with a set of known values or formats, 
and the first matching keyword is used. Keywords 
are sorted in a way that most used ones, or the 
most restrictive ones (i.e. the ones that give a 
match only in rather specific cases) arc at the be- 
ginning, so that in the large majority of cases the 
automatic identification is successful (see Table 1 
for a list of accepted keywords). Note, in particu- 
lar, that the target name is the last tried keyword: 
target names can have many different formats, and 
additionally it is very time-consuming to verify if 
a phrase can be taken to be a target name. As 
mentioned in Sect. 3, the user can verify in real 
time the operated identification, and if necessary, 
in the rare cases where the identification is inap- 
propriate or where the keyword cannot be auto- 
matically recognized (see last column of Table 1), 
specify the desired keyword. 

4.2. Formats for values 

A second key feature of the lEAD system is 
the treatment of values and their identification. 
This part of the parsing process influences the au- 
tomatic keyword identification and therefore has 
been designed with particular care. The currently 
possible kind of values include: 

Word. A single word can be used to specify, for 
example, an instrument, a filter, or a dataset 
ID. Typically, these values are can be as- 
signed an automatic keyword for quantities 
that can only assume a value among a lim- 
ited list of values (such as the instruments) 
or that have a specific format (such as the 
dataset ID). 

Phrase. A few values are composed by several 

words (cf. for example the dataset keyword 
in the examples of Sect. 2). These can be 
just placed one after the other, or can be 
enclosed within single or double quotes to 
avoid ambiguities. 

Number. Integer or floating point values. Float- 
ing point numbers can use the scientific no- 
tation d.dddezszdd, i.e. with the letter e as 
separator between the mantissa and the ex- 
ponent. 
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Number with unit. A number immediately fol- 
lowed by a unit without spaces is used for 
many terms, such as exposure time or pixel 
scale. Different units arc allowed for the 
same quantity (for example seconds, min- 
utes, hours for the exposure time; see Ta- 
ble 2). 

Angle. Coordinates are entered in the format 
± ddd : mm : ss {hh : mm : ss for Right As- 
cension) or with optional decimals for the 
(arc)seconds; when a keyword is specified, 
one can alternatively use fractional degrees 
or minutes. Without keywords, coordinates 
are always taken to be equatorial, and in 
particular coordinates without sign are in- 
terpreted as Right Ascention, and coordi- 
nates with sign are interpreted as Declina- 
tion. 

Date. Dates must be always entered in the format 

yyyy-mm-dd. 

Range. A range can be entered using the format 
min . . max, where either min or max can be 
omitted (but not both) . Ranges arc accepted 
for all cases where a numerical value is valid, 
including angles, dates, numbers, and num- 
bers with unit (in this case, a unit entered 
only for one of the two extremes is applied 
to the other extreme too). A range can only 
be used with the equality (=) or inequality 
( ! =) operator, and is then converted inter- 
nally into expressions such as keyword>=min 
and keyword<=m,ax for the equality and key- 
word<min or keyword>max for the inequal- 
ity operators. 

4.3. Special cases 

A few terms are internreted in 



sequence the SIMBAD, NED, and VizieR 
(Ochsenbein et al. 2000) databases. The 
result of the Sesame check is immediately 
shown to the user as soon as it is available, 
typically within a couple of seconds. The re- 
sults are cached, so the same target is never 
queried again to Sesame within the same 
session. 

Coordinate pair. Two dose coordinates terms 
(Right Ascension and Declination, or Galac- 
tic latitude and longitude, or Ecliptic lat- 
itude and longitude) are interpreted as a 
coordinate pair. This is important when 
parsing positional constraints, since these 
are almost always taken with an implicit or 
explicit "fuzziness" (see below "Search ra- 
dius"). 

Search radius. An isolated angle quantity (i.e., 

a number followed by an angular unit, such 
as 2arcsec) is taken to be an indication 
for an angular resolution of the observation. 
However, when the same quantity appears 
close to a coordinate term (such as 12 : 32 : 45 
with is parsed as a Right Ascension coor- 
dinate), to a coordinate pair (see above), 
or to a target name, it is taken to be a 
search radius for around the coordinate, the 
coordinate pair, or the target. When not 
specified, all coordinates with (implicit or 
explicit) equality or inequality operator are 
taken to have a search radius of 10 arcmin. 
Note also that a search radius works differ- 
ently depending if it is applied to a point in 
the celestial sphere (a coordinate pair, see 
below) or to a single coordinate (in this case 
the search "radius" does not identifies a disk 
but rather a stripe in the sky). 



to accommodate particular cases. Many of these 

are especially important because the interface per- 
forms for them a query expansion, i.e. it extents 
the meaning of the human-entered values to adapt 
them to the database. 

Target name. All terms that cannot be auto- 
matically associated to any keyword are 
taken to be target names. These are resolved 
in real time through a call to the Sesame 
name resolver, which by default queries in 



PI names. Since different people have different 
habits for writing names, the system pro- 
cesses PI names so that the first name can be 
entered before or after the last name (more 
complicated cases where a middle name is 
present, or where the last name is composed 
by more words are also contemplated). In- 
ternally the entered PI name is mapped into 
the format used by the database. Addition- 
ally, for Pis the dot is equivalent to the * 
wildcard (see below Sect. 4.4). 
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Spectral range. This is a special kind of number 
with unit, where the value entered is inter- 
preted by requiring that the specific wave- 
length is within the filter sensitivity of the 
dataset (or the specified wavelength range 
has a non-vanishing intersection with the 
filter sensitivity). Additionally, the spec- 
tral range can also be entered using stan- 
dard Johnson-Cousins filter names, which 
arc approximately translated into the cor- 
responding pivot wavelengths. This feature, 
together with the capability of the system to 
recognize simple phrases for data types (such 
as imaging or 2d spectroscopy) makes it 
possible to use the interface without a spe- 
cific knowledge on the HST instrument ca- 
pabilities. 

4.4. Wildccirds and cinchors 

lEAD also accepts two wildcards for text (non- 
numerical) values: the asterisk (*), matching any 
text, and the question mark (?), matching a single 
character. These are the same wildcards used in 
globbing by UNIX shells, and therefore should be 
familiar to many users. 

Additionally, the; systc;m also accept the caret 
(~) to match the beginning of a value, and the dol- 
lar sign ($) to match the end of a value. Therefore, 
to force a keyword to have exactly a given value 
one should use both anchors, as in title="star$. 
Note that the simple title=star would match 
also observations with titles such as "A peculiar 
star in a nebula" . 

Both wildcards and anchors can be escaped us- 
ing the backslash, as in \* or \". 

4.5. Multiple constraints 

Multiple constraints can be simply written one 
after the other. Again, the specific framework 
of our query language, astronomical databases, 
makes it simple to define rules that correspond 
to the ones of a natural language: 

• All terms with (implicit or explicit) equal- 
ity operator that share the same (implicit or 
explicit) keyword are combined with an or 
boolean operator; 

• Other terms are combined with an and 
boolean operator. 



Note that in different contexts this particular 
problem, i.e. the so-called conjunction and dis- 
junction ambiguity, is of difficult solution because 
it requires a knowledge on the relationships among 
the various entities, while in the astronomical con- 
text the solution is obvious: all entities (for exam- 
ple, instruments, cameras, filters) are mutually ex- 
clusive, in the sense that an observation can only 
use one of them at a time. 

These rules make sure that the simple query 

iii42 acs wfc3 

is interpreted as a search for observations around 
M42 carried out either with the ACS or with the 
WFC3 instruments, while 

m42 -acs -wfc3 

is interpreted as a search for observations around 
M42 carried out neither with the ACS nor with 
the WFC3 instruments. 

4.6. Full boolean queries 

The combination of the automatic identification 
and of the simple syntax for c;onibined constraints 
nicely solves the large majority of queries. How- 
ever, in specific cases one might desire or need to 
perform more specific queries involving combina- 
tion of parameters. 

In these situations it is possible to include the 
boolean operators and, or, and not in the query 

and use them as one would normally do in any 
programming language. For example, the string 

(acs and grism) or (stis and prism) 

might represent a sensible query. It is possible to 
mix multiple contraints without boolean operators 
with queries involving boolean operators: for ex- 
ample, the query 

acs grism or stis prism 

would be interpreted exactly as the example above 
(note that the implicit boolean operators inserted 
between multiple constraints have a higher prior- 
ity than explicit boolean operators). 

5. Implementation 

As mentioned earlier, IE AD is entirely written 
in JavaScript, so that the code runs directly on 
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the user's browser and can perform truly interac- 
tive actions such as the automatic display of the 
parsing result and the context-sensitive autocom- 
plction. Indeed, the choice of the programming 
language for the final implementation has been 
mainly driven by the possibility to display interac- 
tive messages to the user (the original prototype 
of the interface was written in Python). 

The code is object oriented and is built over the 
concept of a term, i.e. a combination of keyword- 
operator- value. It defines a different kind of object 
for each different type of term: number (integer or 
float), angle, date, word, phrase, unit, flag, wave- 
length, pi (cf. Table 1). All term classes are orga- 
nized hierarchically with a common ancestor. 

Each object has a constructor, which defines 
the keyword (and associated aliases) of the term, 
the associated field in the database, and type- 
dependent options (for example, for the angle the 
allowed range, the obligatoriness of the sign, and 
a fiag to indicate if the field refers to right ascen- 
sion or not). Additionally, all term classes define 
a number of common members to deal with sev- 
eral common actions, such as the verification of 
the input, the parsing, the error handling, the au- 
tocompletion. Finally, a member function takes 
care of the translation of the term into SQL code 
or into a human readable string. 

The program also defines meta-terms, i.e. 
classes to modify the behaviour of other terms. 
For example there is a meta-term that modifies 
other terms to make the keyword and/or the op- 
erator of a term compulsory; another meta-term 
makes other terms optional (in the sense that if 
not present, than a defaiilt value is used: cf. the 
use of the search box, explained above). However, 
the most important meta-term is the one to gen- 
erate ranges that use a pair of dots as separator. 

Finally, all terms are combined together into a 
parser for an expression that can involve boolean 
operators and parentheses: again this process is 
realized within a special class with a structure sim- 
ilar to the ones of a term. 

In summary, the code uses the following simple 
scheme: 

1. The query is initially handled by a simple 
lexer that splits the string into tokens, i.e. 
words that have an individual meaning in 
the language used by IE AD. 



2. The tokens are passed to the expression 
parser, which analyses them in order. 

3. The parser tries to parse the various terms 
by trying, in sequence, all term types that 
define an expression. The first matching 
term is used for the rest of the parsing. The 
last possible term tested is the target one, 
which by default accepts all tokens (unless 
a different keyword is specified). When the 
target token is used, a query to the Sesame 
database is also started in parallel. 

4. The prededing point is repeated until all to- 
kens are consumed. 

5. The code uses the parser to translate the 
query into a human readable string that is 
shown to the user. 

6. When the query is finally executed, the 
parser is also called again to generate an 
SQL code. This code is used directly in a 
special field in an HTML form, and is then 
passed to the server which uses it to directly 
interrogate the database with very little ma- 
nipulation. 

5.1. Internal database 

The lEAD code uses a simple internal JSON 
database to save the values for the various terms 
that have a fixed set of permitted values, such as 
the instrument names or the filter names. Periodi- 
cally this database is updated to reflect the status 
of the full HST archive: this process is particularly 
important for values such as the PI names, which 
are likely to change often (basically each time ob- 
servations from a new PI are carried out). 

This simple process is handled by a straight- 
forward Python script. The only interesting point 
to note here is that the script performs the same 
analysis for the description term. Since (almost) 
each proposal has a different description, often 
containing several words, it would be unpractical 
and probably not very useful to use these values as 
they are for the description term. Rather, the 
Python script analyses the varioiis descriptions, 
and extracts from them the most common phrases. 
These are automatically recognized and also used 
for auto completion in the interface. Of course, 
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one is still free to query for a particular descrip- 
tion using a full qualified term, as in description 
=phrase. 

6. Future prospects 

The research presented here is the first step to- 
ward more advanced, efficient, and intuitive query 
interfaces for astronomical databases. However, 
this should not be considered the definitive, op- 
timal solution. Instead, given the role that as- 
tronomical databases will play in the future as- 
tronomical research, it is critical to develop even 
more advanced interfaces to fulfill the needs of the 
users and to stimulate different uses of the astro- 
nomical databases. The interface discussed in this 
paper could be improved in various ways, and it is 
useful to briefly consider future research directions 
here. 

The interface could provide instant previews of 
the entered query, without forcing the user to press 
the "Search" button each time. This would allow 
one to explore in real time the archive and as such 
would represent a major step forward. Unfortu- 
nately, at the present the structure and the query 
performance of most archives does not allow such 
a search to be performed in real time, and major 
enhancements in the database software and servers 
would be needed for this task. 

Improvements could also be made in the auto- 
complction scheme. So far, the autocomplction 
system presents a truncated (or, optionally, the 
full) list of possible completions for a query, but 
by no means it tries to perform a "smart" job. 
First, the list produced is presented in alphabet- 
ical order, which is not the most useful one (it 
would be much more sensible to present the list 
of completions by sorting them by "popularity," 
i.e. frequency of use). Second, the autocomple- 
tion is only grammatically context-sensitive, not 
semantically: the proposed completions are likely 
to contain possibilities that would be considered 
obviously wrong by an experienced user (for ex- 
ample, the interface would include the WFPC2 in- 
strument as a possible completion of "WF" even if 
the user already selected the grism G280, which is 
not available for this instrument). This improve- 
ment, however, appears to be rather complicated 
to develop. 

As mentioned already, the new interface im- 



plements many of the tasks needed in a natu- 
ral language query, such as entity identification, 
query expansion, and conjunction-disjunction dis- 
ambiguation. The ad-hoc implementation is com- 
putationally efficient, but has also various limi- 
tations: the set of keywords must be provided 
to the system together with a way to discover 
from the database the set of allowed values; and 
the query language is kept at a very basic level. 
There are various possible solutions for these is- 
sues. The need for a specific configuration of the 
system (keywords and predefined values) could be 
removed through the use of VO registries (Benson 
et al. 2009): this would make it possible to port 
the interface to all VO-compliant archives, possi- 
bly even without the explicit collaboration from 
the maintainers. Regarding possible extensions of 
the query language, it would be very interesting 
to investigate modern techniques used in natural 
language research, involving statistical inference 
and machine learning. However, one should also 
be aware of the possible risks of a full natural 
language query: on the one hand, the lack of a 
well defined grammar, and therefore of the set of 
possible qiierics, might keep the users away from 
queries that are considered too complex; on the 
other hand, many users would expect the system 
to be "intelligent" and might request queries that 
are outside of its capabilities. 

7. Conclusions 

In this paper I presented lEAD, a new one-line 
query interface for astronomical science databases. 
The major advantages of this interface over stan- 
dard ones are 

• Queries are performed on a single line. This 

makes query pages very clean, avoid the 
clumsiness often present in astronomical 
archive interfaces, and let the user concen- 
trate on the query (instead of on the search 
of right field of each constrain). 

• The interface uses a simple syntax, designed 
to minimize the quantity of text the user has 
to enter and to be close to a natural lan- 
guage. 

• Queries involving complex combination of 
parameters, boolean operators, and paren- 
theses are possible. 



8 



• The interface provides immediate feedback 
on the parsing of the entered string and on 
the resolution of astronomical object names. 
Additionally, it has context sensitive auto- 
completion. 

• The code is easily integrable in any SQL- 
compliant astronomical database, is exten- 
sible and usable for different telescopes or 
observatories. 

I am grateful to F. Stoehr and J. Haase for help- 
ful and constructive interactions, and to R. Fos- 
bury and J.R. Walsh for encouraging and support- 
ing this project. 
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words O 





r 


^orcT 











numeric value 



►^angle^ 



range 



numeric value 



numeric value ^ ^^^^^^^^^ ^umeric value 
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complex value O — \[ — ^ ^ot^ 



or )■<- 
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Fig. 1. — Simplified syntax diagram for lEAD. Non-terminal nodes are indicated using squared boxes, while 
terminal ones have rounded corners. 
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Table 1: List of currently accepted keywords and aliases. 



Keyword 


Aliases 


Type 


Units 


Automatic 


inst 


instr, instrument 


word 





yes 


camera 


detector 


word 




yes 


filter 


opt_elem, optical_element 


word 





yes 


opt_elem_type 


optical_element_type 


word 





yes 


d.ata_type 


— 


phrase 





yes 


date 


obsdate, observation_date 


date 





yes 


released 


release_date, observation_date 


date 





no 


ra 





angle'* 





yes 


dec 


decl, declination 


angle** 





yes 


glon 


galacticJLongitude, gal_lon 


angle'' 





yes 


glat 


galactic ^Latitude, gal_lat 


angle'' 




■yes 


elon 


eclipticJLongitude, ecl_lon 


angle'^ 





yes 


elat 


eclipticJLatitude, ecl_lat 


angle'' 





yes 


box 


radius, r, within, search_box 


unit 


angle 


yes'' 


hst -target 


hst -target -name, hst-uame 


phrase 




no 


description 


descr, target-description. 


phrase 




yes^ 




targetde script ion 










descript, targetdescription 








exptime 


exposure, exposure-time 


unit 


time 


yes 


prop 


proposal, proposalid. 


integer 





■yes 




proposal-id, prop-id 








pi 


pi-name, piname, principal-investigator 


pi 




yes 


dataset 


dataset-name, data-set-name 


word 





yes 


title 


proposal-title, prop-title 


phrase 


— 


no 


resolution 


spat ialjresolut ion 


unit 


arcsec 


yes'' 


scale 


pixel-scale, pixel 


unit 


arcsec 


no 


slew 


moving, moving-object, moving-target 


flag 




no 


wavelength 


wave, lambda 


wavelength 


wavelength^ 


yes 


bandwidth 




unit 


wavelength 


no 


specjres 


spectral jresolut ion 


unit 


wavelength 


no 


res_power 


resolving-power, respower 


float 




yes 


time_start 


start 


date 




no 


tinie_end 


end 


date 




no 


members 


no-members 


integer 




no 


mode 


photon-mode 


phrase 




no 


extension 


science-extension 


phrase 




yes 



"■''■'^Coordinate pairs. 

■^Thc box term is valid only near a coordinate, and in that case taJces priority over resolution. Therefore, resolution is 

automatic only when there is no nearby coordinate. 
'^The description is automatically recognized for the most common description phrases, as deduced from a periodic analysis of 

the proposal database; typical examples include "star forming region" or "cluster of galaxies". 
■'^In addition to wavelength units, the following astronomical bands are also recognized: ultraviolet, optical, infrared, u, b, 

g> V, r, i, z, j, h, and k. 
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Table 2: List of unit types used for different keywords (of. Table 1). 



Unit type 


Default 


Units 


Factor 


Angle 


arcsec 


arcsec, arcsecond, arcseconds 
arcmin, arcminute, arcminutes 
deg, degree, degrees 


1/3600 

1/60 

1 


Time 


s 


s, sec, second, seconds 
m, min, minute, minutes 
h, hour, hours 


1 

60 

3600 


Wavelength 


run 


nm, nanometer, nanometers 
a, ang, angstrom, angstroms 
um, micron, microns 


1 

1/10 
1000 
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